Hello World under Uncertainty!¶

In this notebook, we will use the FoSRL library to train a safe policy in a uncertain control-affine system.

We consider a linear dynamical system with additive uncertainty: $$ \dot{x} = A \, x + B \, u + z $$

where x=[px, py], u=[vx, vy] and z=[zx, zy].

uncertainty table

Among many types of uncertainty, we demonstrate the use of FoSRL on additive bounded uncertainty.

As already shown in the previous tutorial, FoSRL operates in two distinct phases:

  1. An offline counter-example guided pretraining phase, where FoSRL is going to build a Control Barrier Function (CBF) to remain within the set of safe states.
  2. An online safe interactive learning phase, where FoSRL is going to iteratively (and safely) explore the environment, collect the reward signal, and update the policy to maximize the reward collection.

The changes to cope with uncertainty mainly affect the first phase, while the second remains the same as in the previous tutorial.

In [1]:
# Import the necessary libraries
import numpy as np
import torch
from fosco.systems import make_system
from fosco.systems.uncertainty import add_uncertainty
from fosco.config import CegisConfig
from fosco.cegis import Cegis

# plotting utilities
from fosco.plotting.domains import plot_domain
from fosco.plotting.constants import DOMAIN_COLORS
from plotly.graph_objs import Figure

%matplotlib widget
In [2]:
seed = 42
verbosity = 1

Offline Counter-example guided Pretraining¶

Define the symbolic assumptions on the dynamics and uncertainty¶

To enable the counter-example guided pretraining, we need to define a set of symbolic assumptions on the environment dynamics that we use to learn and verify the CBF.

For all systems, the assumptions consist of the characterization of the state and spaces, and sub-domains for initial, unsafe and other states. Moreover, since we are considering an uncertain system, we need to define the uncertainty domain.

Each domain must be a symbolic set. We offer several implementations of common multi-dimensional sets, such as:

  • Rectangle, Sphere,
  • Union, Intersection, Complement,
  • and others.
In [3]:
# define control-affine dynamical system
system_id = "SingleIntegrator"
uncertainty_type = "AdditiveBounded"
system = make_system(system_id)()
system = add_uncertainty(uncertainty_type, system=system)
print(type(system))
<class 'fosco.systems.uncertainty.additive_bounded.AdditiveBounded'>
In [4]:
# define state domains and input domains
# and initial, unsafe, lie state domains
domains = system.domains

print("Domains: ", list(domains.keys()), "\n")
for k, dom in domains.items():
    print(f"{k}: {dom}")

# Visualization of state domains
fig = Figure()
for dname, domain in domains.items():
    if dname in DOMAIN_COLORS:        
        color = DOMAIN_COLORS[dname] if dname in DOMAIN_COLORS else None
        fig = plot_domain(domain, fig, color=color, label=dname)
fig.update_traces(showlegend=True)

fig.show()
Domains:  ['lie', 'input', 'init', 'unsafe', 'uncertainty'] 

lie: Rectangle((-5.0, -5.0), (5.0, 5.0))
input: Rectangle((-5.0, -5.0), (5.0, 5.0))
init: Complement(Rectangle((-4.0, -4.0), (4.0, 4.0)))
unsafe: Sphere((0.0, 0.0), 1.0)
uncertainty: Sphere((0.0, 0.0), 1.0)

For this system, there are now five domains:

  • input: the domain of control actions;
  • uncertainty: the domain of the uncertainty.
  • init: the domain of initial states, for which we want the CBF to be positive;
  • unsafe: the domain of unsafe states, for which we want the CBF to be negative;
  • lie: the domain of all the states, for which we will try to enforce the CBF condition on Lie derivative.

Define the numerical data¶

Having defined the symbolic expressions for the verification of the CBF, we need their numerical counter-parts to have training data for the learning.

Here, for each of the state domains, we are going to define a dataset of samples:

  • init: a dataset of initial states;
  • unsafe: a dataset of unsafe states;
  • lie: a dataset of (state, action) pairs.

Note: we do not directly define the training data, because the tool expects to get generator functions for each of them.

In [5]:
# data generator
from fosco.common.consts import DomainName as dn

data_gen = {
    'init': lambda n: domains[dn.XI.value].generate_data(n),
    'unsafe': lambda n: domains[dn.XU.value].generate_data(n),
    'lie': lambda n: torch.concatenate([
        domains["lie"].generate_data(n), 
        domains["input"].generate_data(n),
        domains["uncertainty"].generate_data(n),
    ], dim=1
    ),
    'uncertainty': lambda n: torch.concatenate([
        domains["lie"].generate_data(n),
        domains["input"].generate_data(n),
        domains["uncertainty"].generate_data(n),
    ],
    dim=1,
    )
}

Define the configuration¶

It remains to define the configuration of the pretraining and its hyper-parameters.

There are two important changes to the configuration:

  • the CERTIFICATE is now set to rcbf to indicate we are looking for a Robust CBF;
  • the LOSS_WEIGHTS includes the uncertainty loss term and an additional regularization term.
In [6]:
config = CegisConfig(
    SEED=seed,              # the seed for reproducibility
    CERTIFICATE="rcbf",      # the type of certificate, either cbf or rcbf
    VERIFIER="z3",          # the type of verifier, either z3 or dreal
    ACTIVATION=["htanh"],   # the activation of the i-th hidden layer
    N_HIDDEN_NEURONS=[20],  # the nr of neurons of the i-th hidden layer
    CEGIS_MAX_ITERS=20,     # the maximum number of iterations
    N_DATA=5000,            # the nr of samples in each training dataset
    RESAMPLING_N=100,       # the nr of points to sample around each counter-example
    RESAMPLING_STDDEV=0.1,  # the std deviation to sample around each counter-example    
    LOSS_WEIGHTS={          # the weights for each loss term
        'init': 1.0, 
        'unsafe': 1.0, 
        'lie': 1.0,
        'robust': 1.0,
        'conservative_b': 1.0,
        'conservative_sigma': 0.1        
    },
)

Let us spend some words on the loss weights.

For finding a candidate CBF, we minimize the following loss

$$ \mathcal{L} = \lambda_{init} \, \mathcal{L}_{init} + \lambda_{unsafe} \, \mathcal{L}_{unsafe} + \lambda_{lie} \, \mathcal{L}_{lie} + \lambda_{robust} \, \mathcal{L}_{robust} + \lambda_{reg-b} \, \mathcal{L}_{reg-b} + \lambda_{reg-s} \, \mathcal{L}_{reg-s} $$

where:

  • $\mathcal{L}_{init}$ penalizes counter-examples in the dataset of initial states;
  • $\mathcal{L}_{unsafe}$ penalizes counter-examples in the dataset of unsafe states;
  • $\mathcal{L}_{lie}$ penalizes counter-examples in the lie dataset;
  • $\mathcal{L}_{robust}$ penalizes counter-examples in the robust dataset;
  • $\mathcal{L}_{reg-b}$ penalizes states where the CBF is negative, to discourage overconservative CBF functions.
  • $\mathcal{L}_{reg-s}$ penalizes states where the compensator is positive, to discourage overconservative compensator functions.
In [7]:
from fosco.plotting.functions import plot_torch_function

cegis = Cegis(
    system=system,
    domains=domains,
    config=config,
    data_gen=data_gen,
    verbose=verbosity
)

result = cegis.solve()
INFO:fosco.cegis:Seed: 42
INFO:fosco.cegis:Iteration 1
INFO:fosco.verifier.verifier:init: Counterexample Found: [x0, x1] = tensor([-4.6153,  4.4458]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5101, unsafe: 5000, lie: 5000, uncertainty: 5000
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 2
INFO:fosco.verifier.verifier:init: Counterexample Found: [x0, x1] = tensor([-4.3367,  4.8890]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5000
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 3
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-2.1406, -0.7266, -4.0000, -4.0000,  0.4297,  0.9026]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5101
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 4
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-0.8906,  1.2500, -4.0000,  3.0000,  0.0000, -1.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5202
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 5
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-2.0000,  0.5000, -4.0000, -4.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5303
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 6
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1., -1., -4., -4.,  1.,  0.]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5404
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 7
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.,  1., -4., -2.,  1.,  0.]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5505
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 8
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1., -1., -4., -4.,  1.,  0.]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5606
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 9
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-0.5000, -1.3750, -4.0000, -4.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5707
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 10
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000,  1.1250, -4.0000, -3.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5808
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 11
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000,  1.1250, -4.0000, -3.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5909
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 12
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000,  1.1250, -4.0000, -3.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6010
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 13
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000,  1.1250, -4.0000, -3.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6111
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 14
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5000,  0.0000, -4.0000, -4.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6212
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 15
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5000,  0.0000, -4.0000, -4.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6313
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 16
INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5625,  0.0000, -4.0000, -4.0000,  1.0000,  0.0000]), [] = tensor([])
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6414
INFO:fosco.cegis:
INFO:fosco.cegis:Iteration 17
INFO:fosco.verifier.verifier:No counterexamples found!
INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6414
INFO:fosco.cegis:CEG Pretraining finished after 17 iterations
In [10]:
import plotly.graph_objects as go

fig = go.Figure(layout=dict(width=1000, height=1000))
fig = plot_torch_function(
    function=result.barrier, 
    domains=system.domains,
    fig=fig,
)
fig.show()

fig = go.Figure(layout=dict(width=1000, height=1000))
fig = plot_torch_function(
    function=result.compensator, 
    domains=system.domains,
    fig=fig,
)
fig.show()

Online Safe Interactive Learning¶

Starting from the Robust CBF found in the previous phase, we can now proceed with the policy training.

Gymnasium Wrapper¶

As common in RL libraries, we adopt the gymnasium api to simulate the system.

To make the continuous-time CBF formulation to work in discretized simulation, we use a small time step.

In [11]:
import gymnasium as gym
from fosco.systems.gym_env.system_env import SystemEnv
from fosco.systems.gym_env.rewards import GoToUnsafeReward
from rl_trainer.wrappers.record_episode_statistics import RecordEpisodeStatistics

max_steps = 100
sim_dt = 0.1
num_envs = 3

def make_env(seed, render_mode=None):

    def thunk():
        env = SystemEnv(
            system=system,
            dt=sim_dt,
            max_steps=max_steps,
            reward_fn=GoToUnsafeReward(system=system),
            render_mode=render_mode
        )
        env.action_space.seed(seed)
        env = RecordEpisodeStatistics(env)
        env = gym.wrappers.NormalizeReward(env)
        env = gym.wrappers.TransformReward(env, lambda reward: np.clip(reward, -10, 10))
        return env

    return thunk
    
envs = gym.vector.SyncVectorEnv(
        [
            make_env(seed=seed) for i in range(num_envs)
        ]
    )

Define Safe Policy¶

In [18]:
from rl_trainer.ppo.ppo_config import PPOConfig

config = PPOConfig()
config.num_envs = num_envs
config.num_steps = 2048
config.total_timesteps = 200000
In [19]:
import plotly.graph_objects as go
from rl_trainer.safe_ppo.safeppo_trainer import SafePPOTrainer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
trainer = SafePPOTrainer(envs=envs, config=config, barrier=result.barrier, compensator=result.compensator, device=device)

Training¶

In [20]:
results = trainer.train(envs=envs, verbose=verbosity)
INFO:rl_trainer.ppo.ppo_trainer:iteration 1/32
INFO:rl_trainer.ppo.ppo_trainer:SPS: 183
INFO:rl_trainer.ppo.ppo_trainer:iteration 2/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 6144/200000 
	episodic returns: -532.93 +/- 97.55 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 176
INFO:rl_trainer.ppo.ppo_trainer:iteration 3/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 12288/200000 
	episodic returns: -490.32 +/- 112.82 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 174
INFO:rl_trainer.ppo.ppo_trainer:iteration 4/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 18432/200000 
	episodic returns: -487.87 +/- 64.57 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 163
INFO:rl_trainer.ppo.ppo_trainer:iteration 5/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 24576/200000 
	episodic returns: -424.07 +/- 86.58 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 163
INFO:rl_trainer.ppo.ppo_trainer:iteration 6/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 30720/200000 
	episodic returns: -438.14 +/- 97.04 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 159
INFO:rl_trainer.ppo.ppo_trainer:iteration 7/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 36864/200000 
	episodic returns: -380.73 +/- 58.51 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 157
INFO:rl_trainer.ppo.ppo_trainer:iteration 8/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 43008/200000 
	episodic returns: -361.75 +/- 54.25 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 154
INFO:rl_trainer.ppo.ppo_trainer:iteration 9/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 49152/200000 
	episodic returns: -370.12 +/- 73.51 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 152
INFO:rl_trainer.ppo.ppo_trainer:iteration 10/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 55296/200000 
	episodic returns: -314.30 +/- 41.00 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 150
INFO:rl_trainer.ppo.ppo_trainer:iteration 11/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 61440/200000 
	episodic returns: -293.47 +/- 35.19 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 148
INFO:rl_trainer.ppo.ppo_trainer:iteration 12/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 67584/200000 
	episodic returns: -267.14 +/- 32.23 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 149
INFO:rl_trainer.ppo.ppo_trainer:iteration 13/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 73728/200000 
	episodic returns: -271.40 +/- 28.44 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 147
INFO:rl_trainer.ppo.ppo_trainer:iteration 14/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 79872/200000 
	episodic returns: -263.61 +/- 23.42 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 147
INFO:rl_trainer.ppo.ppo_trainer:iteration 15/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 86016/200000 
	episodic returns: -251.17 +/- 16.95 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 148
INFO:rl_trainer.ppo.ppo_trainer:iteration 16/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 92160/200000 
	episodic returns: -256.43 +/- 20.48 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 146
INFO:rl_trainer.ppo.ppo_trainer:iteration 17/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 98304/200000 
	episodic returns: -250.46 +/- 17.90 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 147
INFO:rl_trainer.ppo.ppo_trainer:iteration 18/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 104448/200000 
	episodic returns: -243.88 +/- 23.26 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 145
INFO:rl_trainer.ppo.ppo_trainer:iteration 19/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 110592/200000 
	episodic returns: -234.85 +/- 15.38 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 146
INFO:rl_trainer.ppo.ppo_trainer:iteration 20/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 116736/200000 
	episodic returns: -231.38 +/- 19.23 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 145
INFO:rl_trainer.ppo.ppo_trainer:iteration 21/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 122880/200000 
	episodic returns: -231.78 +/- 11.90 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 144
INFO:rl_trainer.ppo.ppo_trainer:iteration 22/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 129024/200000 
	episodic returns: -238.40 +/- 17.69 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 141
INFO:rl_trainer.ppo.ppo_trainer:iteration 23/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 135168/200000 
	episodic returns: -233.04 +/- 14.28 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 24/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 141312/200000 
	episodic returns: -227.16 +/- 13.14 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 25/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 147456/200000 
	episodic returns: -236.40 +/- 12.52 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 26/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 153600/200000 
	episodic returns: -232.63 +/- 16.54 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 27/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 159744/200000 
	episodic returns: -222.71 +/- 12.68 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 28/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 165888/200000 
	episodic returns: -228.34 +/- 17.19 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 142
INFO:rl_trainer.ppo.ppo_trainer:iteration 29/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 172032/200000 
	episodic returns: -226.56 +/- 9.05 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 143
INFO:rl_trainer.ppo.ppo_trainer:iteration 30/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 178176/200000 
	episodic returns: -232.41 +/- 14.47 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 143
INFO:rl_trainer.ppo.ppo_trainer:iteration 31/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 184320/200000 
	episodic returns: -228.50 +/- 10.80 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 143
INFO:rl_trainer.ppo.ppo_trainer:iteration 32/32
INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: 
	global step: 190464/200000 
	episodic returns: -221.86 +/- 13.60 
	episodic costs: 0.00 +/- 0.00 
	episodic lengths: 100.00 +/- 0.00 

INFO:rl_trainer.ppo.ppo_trainer:SPS: 143
In [22]:
# process training metrics
train_steps = results["train_steps"]
train_returns = results["train_returns"]
train_costs = results["train_costs"]

# group by steps (possible duplicate in steps due to vectorization)
train_steps, train_returns, train_returns_std, train_costs, train_costs_std = zip(*sorted(
    [(step, 
      np.mean([r for r, s in zip(train_returns, train_steps) if s == step]), 
      np.std([r for r, s in zip(train_returns, train_steps) if s == step]),
      np.mean([c for c, s in zip(train_costs, train_steps) if s == step]),
      np.std([c for c, s in zip(train_costs, train_steps) if s == step])) for step in set(train_steps)]
))
In [23]:
# load baseline data for comparison
baseline_data = np.genfromtxt("data/results_nocbf.csv", delimiter=",", comments="#")[1:]

baseline_steps = baseline_data[:, 0]
baseline_returns = baseline_data[:, 1]
baseline_costs = baseline_data[:, 2]

# group by steps (possible duplicate in steps due to vectorization)
baseline_steps, baseline_returns, baseline_returns_std, baseline_costs, baseline_costs_std = zip(*sorted(
    [(step, 
      np.mean([r for r, s in zip(baseline_returns, baseline_steps) if s == step]), 
      np.std([r for r, s in zip(baseline_returns, baseline_steps) if s == step]),
      np.mean([c for c, s in zip(baseline_costs, baseline_steps) if s == step]),
      np.std([c for c, s in zip(baseline_costs, baseline_steps) if s == step])) for step in set(train_steps)]
))
In [25]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

axes[0].plot(train_steps, train_returns)
axes[0].fill_between(train_steps, np.array(train_returns) - np.array(train_returns_std), np.array(train_returns) + np.array(train_returns_std), alpha=0.2)
axes[0].plot(baseline_steps, baseline_returns)
axes[0].fill_between(baseline_steps, 
                     np.array(baseline_returns) - np.array(baseline_returns_std), 
                     np.array(baseline_returns) + np.array(baseline_returns_std), alpha=0.2)
axes[0].set_xlabel("Steps")
axes[0].set_title("Episodic Return")

axes[1].plot(train_steps, train_costs)
axes[1].fill_between(train_steps, np.array(train_costs) - np.array(train_costs_std), np.array(train_costs) + np.array(train_costs_std), alpha=0.2)
axes[1].plot(baseline_steps, baseline_costs, label="PPO")
axes[1].fill_between(baseline_steps, np.array(baseline_costs) - np.array(baseline_costs_std), 
                     np.array(baseline_costs) + np.array(baseline_costs_std), alpha=0.2)
axes[1].set_xlabel("Steps")
axes[1].set_title("Episodic Cost")

plt.show()
Figure
No description has been provided for this image
In [26]:
eval_envs = gym.vector.SyncVectorEnv([make_env(seed=seed, render_mode='rgb_array') for i in range(1)])
agent = trainer.get_actor()

obs, infos = eval_envs.reset()
done = False
frames = []
while not done:
    action = agent.get_action_and_value(torch.Tensor(obs).to(device))["action"]
    obs, reward, term, trunc, infos = eval_envs.step(action.detach().cpu().numpy())
    done = term[0] or trunc[0]
    frame = eval_envs.envs[0].render()
    frames.append(frame)
In [27]:
# save gif
import imageio
imageio.mimsave("safe_robust_policy.mp4", frames, fps=10)

animation

In [28]:
from IPython.display import Video

Video("safe_robust_policy.mp4")
Out[28]:
Your browser does not support the video element.
In [37]:
import time
t0 = time.time()

batch_size = 100
print(f"Seed {seed}")

env = SystemEnv(
    system=system,
    dt=sim_dt,
    max_steps=max_steps,
)
obs, info = env.reset(
    seed=seed, options={"batch_size": batch_size, "return_as_np": False}
)
terminations = truncations = np.zeros(batch_size, dtype=bool)

traj = {"x": [obs], "u": []}
while not (any(terminations) or any(truncations)):
    # with torch.no_grad():
    obs = obs[None] if len(obs.shape) == 1 else obs
    u = agent.get_action_and_value(torch.Tensor(obs).to(device))["action"]
    u = u.detach().numpy()

    obs, rewards, terminations, truncations, infos = env.step(u)

    traj["x"].append(obs)
    traj["u"].append(u)

print(f"Sim time: {time.time() - t0} seconds")
Seed 42
Sim time: 18.42903470993042 seconds
In [38]:
traj["x"] = np.array(traj["x"])
traj["u"] = np.array(traj["u"])

fig, ax1 = plt.subplots(1, 1, figsize=(10, 10))
for i in range(traj["x"].shape[1]):
    xs = traj["x"][:, i, 0]
    ys = traj["x"][:, i, 1]
    ax1.plot(xs, ys, color="blue")
    ax1.scatter(xs[0], ys[0], marker="x", color="k")

# draw circle unsafe set
cx, r = system.domains["unsafe"].center, system.domains["unsafe"].radius
ax1.plot(
    cx[0] + r * np.cos(np.linspace(0, 2 * np.pi, 25)),
    cx[1] + r * np.sin(np.linspace(0, 2 * np.pi, 25)),
    color="r",
    linestyle="dashed",
    label="obstacle",
)

ax1.set_title("Space Trajectories")
ax1.set_xlabel("x[0]")
ax1.set_ylabel("x[1]")
ax1.set_xlim(-5, +5)
ax1.set_ylim(-5, +5)
ax1.axis("equal")
ax1.invert_yaxis()
ax1.legend()
Out[38]:
<matplotlib.legend.Legend at 0x7f160c212e00>
Figure
No description has been provided for this image
In [ ]: